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This document is my summary explanation of the algorithm in "A Spec-' 
[tral Algorithm for Learning Hidden Marko v Models" (COLT 2009), though 
there may be some slight notational inconsistencies with the original paper. 
The exposition and the math here are quite different, so if you don't like 
this explanation, try the original paper! 

The idea is to maintain output predictions in a recursive inference al- 
gorithm, instead of the usual method of maintaining hidden state predic- 
tions, and to represent the HMM only in terms of the maps necessary to 
update output predictions given new data. This approach limits the infer- 
ence computations the algorithm can perform (it can't answer any queries 
about the hidden states since it doesn't explicitly deal with them at all), 
but it also reduces the complexity of the model parameters that are learned 
and thus makes learning easier. The learning algorithm uses an SVD and 
matrix operations, so it avoids the local-optima problems of EM or any 
other algorithms based on maximizing data likelihood over the usual HMM 
parameterization. The COLT paper includes error bounds and analysis. 

Notation. For a vector v in a subspace V C M'^ and a matrix C with 
linearly independent columns and rangc(C) ^ V I will use [v]'~' to denote 
the coordinate vector of v relative to the ordered basis given by the columns 
ofC, and [v] or simply v to denote the coordinate vector of v relative to the 
standard basis ofW'. Similarly, for a linear map A :V ^ V I will use [A\'^ 
to denote the matrix of A relative to domain and codomain bases given by 
the columns of C , and [A\ to indicate its matrix relative to standard bases. 
For a matrix A I will also use [A]ij to denote the (i,j)th entry. 




Definition 1 (Hidden Markov Model). A time-homogeneous, discrete Hid- 
den Markov Model (HMM) is a probability distribution on random variables 
{{xt, ft.t)}tgN satisfying the conditional independences implied by the graph- 
ical model, where range(/it) = [m] '■= {1,2, ...,to} and range(a;t) = [n] 
where n > m. The standard parameterization is the triple (T, O, tt), where 
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We will assume T and O to have full column rank and [ttJ^ > Vj G [m]. 




Figure 1: We can view O as (the matrix of) a map from hidden state beliefs 
to output predictions (with respect to the standard bases). Not shown is 
the fact that both ht and xt he in the simplices of M™ and E", respectively, 
and that O maps the simplex in M™ to (a subset of) the simplex in M". 



Definition 2 (Observation Prediction). An observation prediction for any 
time t is a vector Xt G defined in the standard basis by 



[xt]i ■■= Fiixt = i\xi.,t-i = xi.,t-i] 
for some fixed (implicit) sequence xi;t-i- 



(1) 



Claim 1. Every observation prediction xt lies in a subspaceU := range(O) C 
M" with di-ai{U) = m. 

Proof. By the conditional independences of the HMM, for any t we have 
Vv[xt = i\xi:t-i = xi;t-i] = ^ Pr[xt = i\ht = j] 

• Pr[/it = j|xi:t_l = il:t_l] (2) 



so therefore we can write 

Xf = Oht, [ht]j := Pr[/it = j\xi;t-i = xi-.t-i]- 
Equivalently, we can say [xt]'^ = ht. See Figure [l] 



(3) 

□ 



Claim 2. The observation prediction subspace lA satisfies hi — range(P2,i)7 
where [P2,i\i] ■= Pt[x2 ^ i,xi = j] 
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Proof. We can write the joint distribution over (xi, X2) as 

Pt[x2 = h^i = j] = Pi:[x2 ^i,xi ^ j, hi ^hi,h2 = /12] (4) 

hi h2 

= J2P'^[hi=hi]PT[xi=j\hi^h] 
hi 

■ J2 PAh2 = hlhi = hi]Pr[x2 = i\h2 = h] (5) 

and we can write that sum as P2.1 — OT diag*(7r)0^, where diag*(-) maps 
a vector to a diagonal matrix in the usual way. By our rank and positivity 
assumptions, we see that P2.1 satisfies range(P2.i) — range(O) —U. □ 

We can directly estimate P2.1 with empirical statistics, and as a conse- 
quence of Claim [2] we can then get a basis for U by using an SVD of ^2.1- 
Note that we aren't getting an estimate of O this way, but just its column 
space. 

Claim 3. Given an observation xt — Xt, there is a linear map Bxt : U ^ U 
such that 

Bx^{xt) = axt+i (6) 
for some a — a{xt,xt), a scalar normalization factor chosen to ensure 

l^Xt+l = 1. 

Proof. Following the usual recursive update for HMM forward messages, 
we have 

Pr[ft.4+i = i,a;i:f = xi:t] = Pr[ht_f.i ^ i\ht = j] .Pi[xt = Xt\ht ^ j] 

■^1' ' ' " ' 

• Pr[/lt = j, Xl;t-1 = Xl;t~l] ■ 
^ V ' 

(7) 

Therefore we can write the map B^^ as an m x m matrix relative to the 
basis for U given by the columns of O: 

[Ssj8=Tdiag*(OsJ (8) 

where Ok- denotes the vector formed by the kth row of O. We require 
l^a?t+i = 1, so we have a = Oj .Xt- 

See Figure [2| ' □ 



Note. Renormalization always works because rdiag*(Ojj:) preserves the 
non-negative orthant of M.™' , and O maps the simplex in K™ to (a subset 
of) the simplex in M". The orthant preservation properties of these maps 
are an immediate consequence of the fact that the matrices (with respect 
to standard bases) are entry-wise non-negative. In fact, instead of tracking 
vectors, we should be tracking rays in the non-negative orthant. 



3 




Figure 2: The matrix T diag*{Oxt:) is tlie matrix (relative to standard 
bases) of a linear map that updates hidden state beliefs given an observation 
Xt, up to renormalization which the figure does not show. The linear map 
is the update for output predictions. 

For any new observation x, the map implements the "belief update" 
on output predictions, up to a normalization factor which we can compute 
on the fly via l^xt — 1. Note that we can also write as an n x n matrix 
relative to the standard basis of M" : 

[Bx]=OTdmg*{Ox:)0^ (9) 

where := {0^0)^^0^ is the pseudoinverse of O. Recall Xt = Oht and 
hence ht = O^Xt- 

We would like to write B^ as a matrix without reference to the standard 
HMM parameters (T, O, tt), since we want to avoid learning them at all. 

Claim 4. Let U G JJ"^™ ftg a matrix whose columns Jorm an orthonormal 
basis forlA. We can write Bx as a matrix relative to the standard basis of 
M" as 

[Bx] = P3,x,iU{U^P2.,iU)-'U^ (10) 

where 

[P3,x,i]ij := Pr[a;3 ^i,X2 = x,xi = j] Va: € [n]. (11) 



Proof. We can express the matrix P^^x,! in a form similar to that for the 
matrix [Bx] in Equation 

P3,.,i = OT diag* {Ox:)0^ P2,i- (12) 
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Intuitively, we want to remove the P2,i on the right, since that would give 
us [Bx] in terms of quantities we can readily estimate, but we cannot form 
the inverse of P2,i because it is n x rt and has rank m < n. However, P2,i 
has row and column space U (intuitively, its restriction to U is invertible) , 
thus we can substitute 

P2,i = U{U^P2,iU)U^ (13) 

to get 

= OTdmg*{Ox:)O^U{U^P2.iU)U^ (14) 

and hence 

P3..,iU{U^P2,iU)-^U^ = OTdiag*(0,:)Ot ^ [B.,]. (15) 

□ 

Because we can estimate each Ps^x,! as well as P2,i from data by empir- 
ical statistics, and we can obtain a U using an SVD, we can now estimate a 
representation of Bx from data using the expression in Claim|4j We can also 
directly estimate Pi from empirical statistics, where [Pi]i := Pr[a;i = i], 
and hence we can use these estimated quantities to recursively compute 
Pr[a;j|xi:t_i] and Pr[xi.t_i] given observations up to and including time 
t-1. 

Since dim([/) — m < n, we can use the columns of U as our basis for 
lA to get a more economical coordinate representation of xt and Bx than in 
the standard basis of M": 

Definition 3 (HMM PSR Representation). For any fixed U e M"""" with 
range([/) = U and U~^U = Imxm, we define the belief vector at time t by 

bt := [ft]^ = U^xt e M" (16) 

and in particular for t = I we have 

bi ^ U'^O-K. (17) 

For each possible observation x G [n], we define the matrix Bx G K'"^™ by 

Bx := [Bx]^ = (;7T0)rdiag*(0.:)(C/^0)-^ (18) 

Finally, for normalization purposes, it is convenient to maintain the ap- 
propriate mapping of the ones (co-)vector, noting l^Oh = 1 if and only if 
I'^h = 1 because l^O = l"^; 

b^ [1]^^ = C/Ti e M". (19) 

The box below summarizes the method for learning an HMM PSR rep- 
resentation from data and how to use an HMM PSR representation to 
perform some recursive inference computations. 
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Learning 








U 


= ThinSVD(P2,i) 




h 


= U^Pi 




Bx 


= U^P3,x,i{U^P2,i)^ Vx e [n] 




boo 


= U^1 


Inference 










= b^B^^.^bi sequence probability 


Pr 


Xt\Xi.,t-i] = 


= bl^BxJit prediction 




h+i = 


BxM 

- =; — recursive update 

bloBxM 
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